I <3 Kernels

5 minute read

Published:

In this post I’m going to go through the kernel trick and how it helps or enables various tools in statistics and machine learning including support vector machines, gaussian processes, kernel regression and kernel PCA. This is going to be a bit of a long one, I’ll probably split it up later but for now … sorry?

Table of Contents

General Resources

Intro

The goal of this blog post is to make you the reader aware or appreciate more kernel-based methods. For that I’m going to structure the post as 1. Some linear-ish methods that show promise for being even better in non-linear contexts but seem computationally expensive, 2. How the kernel trick allows us to get around this computational bottlenecks, and finally 3. The final form of the kernel-based methods and those that simply don’t work without it (GPs).

I’m not going to presume knowledge of these methods beforehand as much as possible, but I am going to do a bit of a whirlwind tour, so if any of them seem interesting to you and you don’t think the level of detail I provide is good enough, I’ve tried to include some independent resources for each that maybe will provide another perspective or more detail for every sub-section (the ‘Resources’ sections).

And as should be stated in all of my posts (but to be clear isn’t), I use notation through which I understand everything or simply want consistent notation throughout a given post so will likely differ from standard notation for a given topic(s). If you think I should make a given idea or object different notation either to make it clearer or because it’s simply incorrect please email me at lc[LastNamelowerCase]@[googleAddress].com or [FirstName].[Lastname]@[my institution].edu

Linear Methods

Fisher Linear Discriminatory Analysis (KDA I)

Resources

The Gist

Fisher Linear Discriminatory Analysis or simply FLDA1 is a supervised method (meaning we know the class labels) for creating a linear projection (single value in 1D, line in 2D, plane in 3D) that separates two or more classes of objects. Here we will focus on the separation of just two classes.

The key idea behind FLDA is that you wish to construct some linear combination of the input variables to demarcate the two classes that you project the objects onto. You do this by 1. maximising the distance or variance between the two groups on the projected space and 2. minimising the variance of each group in this space. Below are some examples of this in action before we get into it to emphasise that both conditions must be satisfied to get good discrimination.

Nothing to see here.

In a very non-statiscian way I’m just going to throw the formula here and leave the derivation to another day (as I will do for quite a few things in this post).

\[\begin{align} Z = \frac{\sigma^2_{\textrm{inbetween}}}{\sigma^2_{\textrm{within}}} = \frac{(\vec{w}\cdot(\vec{\mu}_1 - \vec{\mu}_0))^2}{\vec{w}^T\left(\Sigma_0 + \Sigma_1\right)\vec{w}} \end{align}\]

The \(\vec{w}\) is a vector in the direction of line (or linear operator on variables for generality) that is used in the above as a projection operator, to note the statistical measures on the line. There is any analytical solution to this where,

\[\begin{align} \vec{w} \propto (\Sigma_0 + \Sigma_1)^{-1}(\vec{\mu}_1 - \vec{\mu}_0). \end{align}\]

Where the solution is proportional too as increases or decreases in the magnitude of the direction vector still return the same line object. For the sake of a cool gif and for later on when we generalise this method, let’s look at how it looks when you try to optimise for \(\vec{w}\) and compare it to the exact solution above.


Let’s compare the optimisation result to my “guesses” above.

GIF Showing Progression of LDA Optimisation



It's 10pm, I'm not coming up with a caption


It's 10pm, I'm not coming up with a caption


It's 10pm, I'm not coming up with a caption


It's 10pm, I'm not coming up with a caption


It's 10pm, I'm not coming up with a caption



Support Models (SVM I)

Resources

Ridge Regression (KRR I)

Resources

Principle Component Analysis (Kernel PCA I)

Resources

The Kernel Trick

Resources

Awesome Kernel-Based Methods

Kernel Discriminatory Analysis (KDA II)

Resources

Support Vector Machines (SVM II)

Resources

Kernel Ridge Regression (KRR II)

Resources

Kernel Principle Component Analysis (Kernel PCA II)

Resources

Gallery of Kernels

Resources

Gaussian Processes

Resources

Conclusions / Pros and Cons of Kernel Methods

  1. It annoys me to no end that Fisher Linear Discriminatory Analysis and Linear Discriminatory Analysis are commonly used interchangeably. Strictly “Linear Discriminatory Analysis” assumes homoscedacity (same covariances) between the two groups and that they follow normal distributions. It is for this reason that I gave up on finding a probabilistic derivation of FLDA, and I ain’t spending the time deriving it myself. Kernel Discriminatory Analysis as far as I can see is based on Fisher LDA, hence I focus on that.